General problem setting

Reinforcement Learning is one of three possible paradigms in Machine Learning: supervised and unsupervised learning, and reinforcement learning. In the later paradigm one consider a learning agent interacting with an environment. The agent acts repeatedly in the environment and after each action it gets rewards from it, so that the agent generates a sequence of taken actions and get rewards. We denote the action selected on time \(t\) as \(A_t\). It may be seen as a random variable taking its values in the set \(\mathcal{A}\) of all possible actions. Analogously we denote \(R_t\) the reward following action \(A_t\) and taking its value in a set \(\mathcal{R}\). The subelements of this problem setting are:

  • a policy function \(\pi(.)\),
  • a reward signal \(R_t\),
  • a value function \(q_{*}(.)\),
  • a model of the environment.

These subelements are precised for the following reinforcement learning situations, beginning with the simplest one, the so called \(k\)-armed bandit, and pursuing with learning situations where the interactions between the agent and its environment can be described by a discrete Markov process.

The \(k\)-armed bandit

Consider the following learning problem. You are faced repeatedly with a choice among \(k\) different actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time periode - say over 1000 action selections. This is the so called \(k\)-armed bandit problem, so named by analogy to a slot machine, a one-armed bandit, excepted that it has \(k\) levers instead of one. Each action selection is like a play of one of the slot machines’s levers.

Elements of the setting

Each of the \(k\) actions has an expected reward given that that action is selected. Consider an action \(a \in \mathcal{A}\); we define the value of that action as being that expected reward: \[ q_{*}(a):=\mathbf{E}[R_t\,|\,A_t=a] \]

Construction of a 4-armed bandit

In order to construct our multi-armed bandit, we select the true value \(q_{*}(a)\) of each of the \(k=4\) actions according to a normal distribution \(\mathcal{N}(0,1)\). The rewards are then selected according to \(\mathcal{N}(q_{*}(a),1)\).

q_star <- rnorm(4,0,1)
# the 4 meant to be real values of the 4 actions, the i-th component corresponding to to i-th action
q_star
## [1] -1.4567730 -0.6997057  1.5237610  1.3729554
r <- sapply(q_star,function(x){rnorm(1,x,1)})
# the rewards
r
## [1] -1.08574576 -0.17380556  0.06385564  0.74939099

On a time step \(t\), the agent don’t know the vectors \(q_{\mbox{star}}\) and \(r\). It chooses an action among the four possible actions, and it gets a reward after that action. That reward corresponds to one among the four components of the vector \(r\), say the third one for the third action. That information helps the agent to fit its estimate of the real value of that action which then corresponds to the third component of the vector \(q\).

If you knew the value \(q_{*}(a)\) of each action, then you would solve the k-armed bandit problem by always selecting the action with higtest value. We assume that you don’t know the action values but may have estimates. We denote the estimated value of action \(a\) at time step \(t\) as \(Q_t(a)\). We may calculate it as the average of the already received rewards, from the first time step 1 til the time step \(t-1\):

\[ Q_t(a):=\frac{\sum_{i=1}^{t-1}R_i\cdot\mathbf{1}_{A_i=a}}{\sum_{i=1}^{t-1}\mathbf{1}_{A_i=a}} \]

which can be seen as the following weighted sum:

\[ Q_t(a):=\sum_{i=1}^{t-1}\frac{\mathbf{1}_{A_i=a}}{\sum_{i=1}^{t-1}\mathbf{1}_{A_i=a}}\cdot R_i \]

The greedy action selection method can be defined by:

\[ A_t:=\mbox{argmax}_a Q_t(a). \]

Greedy action selection always exploits current knowledge about what is the highest value of actions in order to maximize the next reward. The problem with that selection method is that some actions could never be selected, even if they have eventually higher values, just because the agent don’t spend any time on exploring its environment.

Implementation: an \(\varepsilon\)-greedy algorithm

We now implement a sequence \(A_0,R_0,A_1,R_1,\ldots,A_T,R_T\) generated by an agent selecting its actions repeatedly over \(T\) time steps.

The obvious implementation would be to use the above formula for the estimated action value, but that would mean to maintain a record of all the \(R_i\) and then calculate their average. By doing so the computation time would grow with the number of steps. Instead of this we reformulate the above formula in order its implementation to require memory only for (\(\forall a \in \mathcal{A}\)) the actual action value and the number of times the action \(a\) has been selected \(n(a):=\sum_{i=1}^{t-1}\mathbf{1}_{A_i=a}\), and require only a small computation:

\[ \begin{eqnarray*} Q_{n(a)+1} &=& \frac{1}{n(a)}\cdot\sum_{i=1}^{n(a)}R_i \\ &=& \frac{1}{n(a)}\cdot\left(R_{n(a)}+\sum_{i=1}^{n(a)-1}R_i\right) \\ &=& \frac{1}{n(a)}\cdot\left(R_{n(a)}+(n(a)-1)\cdot Q_{n(a)}\right) \\ &=& \frac{1}{n(a)}\cdot\left(R_{n(a)}+n(a)\cdot Q_{n(a)}-Q_{n(a)}\right) \\ &=& Q_{n(a)}+\frac{1}{n(a)}\left(R_{n(a)}-Q_{n(a)}\right) \end{eqnarray*} \]

Here, all indices run over the set of the steps associated with the action \(a\) only. For indices running over all steps we define:

\[ Q_{n+1} = \left\{ \begin{array}{l} Q_{n}+\frac{1}{n(a)}\left(R_{n}-Q_{n}\right) \quad \mbox{if the }(n+1)\mbox{-th selected action is action }a\,, \\ Q_{n} \quad \mbox{else.}\\ \end{array} \right. \]

rdu<-function(n,k) sample(1:k,n,replace=T)
q_star <- rnorm(4,0,1)
# the real values of the 4 actions

Q <- function(n,a,q){
# the estimated values
  
# input:  n an integer >=1
#         a the action taken at last time step
#         q the action values at last step (a matrix, see below under "output")
  
# output: the estimated action values at step n.
#         It is a matrix of 2 row vectors.
#         The first vector is the vector of the estimated action values (the value_vector).
#         The second vector gives the number of times action i has been selected (the number_vector).
#         The second vector is needed in the computation of the first vector.

  if(n==1){
    len <- 4
    v <- vector("numeric",length=2*len)
    dim(v)=c(2,len)
    v
    # a priory knowledge
  }
  else{
    q_value <- q[1,]
    q_number <- q[2,]
    len <- length(q_value)
    v <- vector("numeric",length=len)
    for(j in 1:len){
      v[j] <- ifelse(j==a,
             q_number[a]+1,
             q_number[j])
    }
    number_vector <- v
    v <- vector("numeric",length=len)
    for(j in 1:len){
      v[j] <- ifelse(j==a,
             q_value[a]+1/number_vector[a]*(r-q_value[a]),
             q_value[j])
    }
    value_vector <- v
    
    rbind(value_vector,number_vector)
  }
}

Selection of the action: Let be \(\varepsilon<<1\) a small positive number. Most of the time (with probability of 1-\(\varepsilon\)) the selected action is the one with the highest estimated action value \(q_*\), and rarely (with probability of \(\varepsilon\)) the action is selected randomly, in order to explore values that wouldn’t otherwise be selected at all.

That’s the very concept behind a so called \(\varepsilon\)-greedy algorithm: most of the time the algorithm exploits the estimates of the action values it has at its disposal, but sometimes the algorithm explore other possibilities in order to improve its estimates of the action values.

A <- function(n,q,epsilon){
# selection of the action
  
  if(runif(1,0,1)<epsilon || n==1){
    rdu(1,4)
  }else
  {
    u_ <- which.max(q[1,])
    u_
  }
}

The first 3 steps of the \(\varepsilon\)-algorithm:

q <- Q(1,a,q)
cat("Q(",1,",a) : \n")
## Q( 1 ,a) :
q
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
a <- A(1,q,0.1)
cat("A(",1,") : ",a,"\n")
## A( 1 ) :  1
r <- sapply(q_star,function(x){rnorm(1,x,1)})[a]
cat("r : ",r,"\n\n")
## r :  -0.07916642
q <- Q(2,a,q)
cat("Q(",2,",a) : \n")
## Q( 2 ,a) :
q
##                      [,1] [,2] [,3] [,4]
## value_vector  -0.07916642    0    0    0
## number_vector  1.00000000    0    0    0
a <- A(2,q,0.1)
cat("A(",2,") : ",a,"\n")
## A( 2 ) :  2
r <- sapply(q_star,function(x){rnorm(1,x,1)})[a]
cat("r : ",r,"\n\n")
## r :  -0.8565267
q <- Q(3,a,q)
cat("Q(",3,",a) : \n")
## Q( 3 ,a) :
q
##                      [,1]       [,2] [,3] [,4]
## value_vector  -0.07916642 -0.8565267    0    0
## number_vector  1.00000000  1.0000000    0    0
a <- A(3,q,0.1)
cat("A(",3,") : ",a,"\n")
## A( 3 ) :  3
r <- sapply(q_star,function(x){rnorm(1,x,1)})[a]
cat("r : ",r,"\n\n")
## r :  -0.3591498

In order to compare the \(\varepsilon\)-algorithm for different values of \(\varepsilon\), we compute the sum of all cumulated rewards in order to define the average reward over all steps as a measurement for comparison.

sum_of_rewards <- vector("numeric",2)
cumul_of_sum_of_rewards <- sum_of_rewards
for(n in 1:3){
  q <- Q(n,a,q)
  cat("Q(",n,",a) : \n")
  print(q)
  a <- A(n,q,0.1)
  cat("A(",n,") : ",a,"\n")
  r <- sapply(q_star,function(x){rnorm(1,x,1)})[a]
  cat("r : ",r,"\n\n")
  sum_of_rewards[1] <- sum_of_rewards[1]+r
  sum_of_rewards[2] <- n
  cumul_of_sum_of_rewards <- rbind(cumul_of_sum_of_rewards,sum_of_rewards)
  cat("sum_of_rewards : ",sum_of_rewards,"\n\n")
}
## Q( 1 ,a) : 
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## A( 1 ) :  3 
## r :  0.4575045 
## 
## sum_of_rewards :  0.4575045 1 
## 
## Q( 2 ,a) : 
##               [,1] [,2]      [,3] [,4]
## value_vector     0    0 0.4575045    0
## number_vector    0    0 1.0000000    0
## A( 2 ) :  3 
## r :  -0.4914879 
## 
## sum_of_rewards :  -0.03398336 2 
## 
## Q( 3 ,a) : 
##               [,1] [,2]        [,3] [,4]
## value_vector     0    0 -0.01699168    0
## number_vector    0    0  2.00000000    0
## A( 3 ) :  1 
## r :  -0.9765299 
## 
## sum_of_rewards :  -1.010513 3
cumul_of_sum_of_rewards
##                                [,1] [,2]
## cumul_of_sum_of_rewards  0.00000000    0
## sum_of_rewards           0.45750452    1
## sum_of_rewards          -0.03398336    2
## sum_of_rewards          -1.01051330    3

We now run the algorithm for -say- 1000 steps:

epsilon <- 0.1
sum_of_rewards <- vector("numeric",2)
cumul_of_sum_of_rewards <- sum_of_rewards
for(n in 1:1000){
  q <- Q(n,a,q)
  a <- A(n,q,epsilon)
  r <- sapply(q_star,function(x){rnorm(1,x,1)})[a]
  sum_of_rewards[1] <- sum_of_rewards[1]+r
  sum_of_rewards[2] <- n
  cumul_of_sum_of_rewards <- rbind(cumul_of_sum_of_rewards,sum_of_rewards)
}
cat("Q(",1,",a) : \n")
## Q( 1 ,a) :
print(q)
##                      [,1]       [,2]        [,3]      [,4]
## value_vector    0.3106265 -0.4577236 -0.08260344 -1.058675
## number_vector 926.0000000 22.0000000 25.00000000 26.000000
cat("A(",1,") : ",a,"\n")
## A( 1 ) :  1
cat("r : ",r,"\n\n")
## r :  -1.552423
cat("cumul_of_sum_of_rewards : \n\n")
## cumul_of_sum_of_rewards :
head(cumul_of_sum_of_rewards,10)
##                                [,1] [,2]
## cumul_of_sum_of_rewards  0.00000000    0
## sum_of_rewards          -1.90123715    1
## sum_of_rewards          -0.47492356    2
## sum_of_rewards           0.40334837    3
## sum_of_rewards           0.29411711    4
## sum_of_rewards           0.02788743    5
## sum_of_rewards          -0.92650332    6
## sum_of_rewards          -2.08247569    7
## sum_of_rewards          -1.15436231    8
## sum_of_rewards          -1.00728479    9

The vector of the real action values, which is unknown by the agent, is

q_star
## [1]  0.3395214 -0.3543642  0.1632728 -1.3675473

We now compare the learning rate of the \(\varepsilon\)-greedy algorithm for different values of \(\varepsilon\).

Tracking a non stationary problem

Learning problems are often non stationary. In such cases it makes sense to give more weight to recent rewards than to long-past rewards. To achieve that, we just adapt the calculation of the action values \(Q_{n+1}\). Instead of defining (for \(n\in\mathbb{N}\))

\[ Q_{n+1} = \left\{ \begin{array}{l} Q_{n}+\frac{1}{n(a)}\left(R_{n}-Q_{n}\right) \quad \mbox{if the }(n+1)\mbox{-th selected action is action }a\,, \\ Q_{n} \quad \mbox{else.}\\ \end{array} \right. \]

that is, a weighted sum where the weights depend on \(n\), we define

\[ Q_{n+1} = \left\{ \begin{array}{l} Q_{n}+\alpha\cdot\left(R_{n}-Q_{n}\right) \quad \mbox{if the }(n+1)\mbox{-th selected action is action }a\,, \\ Q_{n} \quad \mbox{else.}\\ \end{array} \right. \] where the weights \(\alpha\) are constant and not depending on the time step \(n\). We may verify that the expression \(Q_{n}+\alpha\cdot\left(R_{n}-Q_{n}\right)\) is indeed a weighted sum in the sense that all weights sum to one. To see that consider:

\[ \begin{eqnarray*} Q_{n+1} &=& Q_{n}+\alpha\cdot\left(R_{n}-Q_{n}\right) \\ &=& \alpha\cdot R_n+(1-\alpha)\cdot Q_n \\ &=& \alpha\cdot R_n+(1-\alpha)\cdot [\alpha\cdot R_{n-1}+(1-\alpha)\cdot Q_{n-1}] \\ &=& \alpha\cdot R_n+(1-\alpha)\cdot \alpha\cdot R_{n-1}+(1-\alpha)^2\cdot\alpha\cdot R_{n-2}+\ldots \\ &=& (1-\alpha)^n\cdot Q_1 +\sum_{i=1}^n \alpha\cdot(1-\alpha)^{n-1}\cdot R_i \\ \end{eqnarray*} \]

The gradient bandit algorithm

So far we have considered methods that estimate action values und use those estimates to select actions. Another approach consists in learning preferences (numerical preferences) for each action \(a\), which we denote \(H_t(a)\). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important.

\[ P(A_t=a):=\frac{\exp(H_t(a))}{\sum_{b=1}^k\exp(H_t(b))}=:\pi(a). \]

It is the Boltzmann distribution, or soft-max distribution.

Initially all action preferences are the same, i.e. \(H_1(a)=0\;\forall a\). There is a learning algorithm based on the idea of gradient ascent:

\[ H_{t+1}(a):=H_t(a)+\alpha\cdot\frac{\partial\,E[R_t]}{\partial\,H_t(a)}\,, \]

where

\[ E[R_t]=\sum_x \pi_t(x)\cdot q_{*}(x). \]

We cannot implement gradient ascent exactly because by assumption we do not know the real action values \(q_{*}(x)\), but in fact the updates

\[ -\alpha\cdot(R_t-\overline{R}_t)\cdot\pi_t(a) \quad\mbox{and}\quad +\alpha\cdot\frac{\partial\,E[R_t]}{\partial\,H_t(a)} \]

are equal in expected value, as we may see:

\[ \begin{eqnarray*} \frac{\partial\,E[R_t]}{\partial\,H_t(a)} &=& \frac{\partial}{\partial\,H_t(a)}\left[\sum_x \pi_t(x)\cdot q_*(x) \right] \\ &=& \sum_x q_*(x)\cdot \frac{\partial\,\pi_t(x)}{\partial\,H_t(a)} \\ &=& \sum_x (q_*(x)-B_t)\cdot \frac{\partial\,\pi_t(x)}{\partial\,H_t(a)} \\ && \mbox{where }B_t\mbox{ is the baseline.}\\ && \mbox{It can be any number not depending on }x. \\ && \mbox{Indeed } \sum_x \frac{\partial\,\pi_t(x)}{\partial\,H_t(a)} = \frac{\partial\,\overbrace{\sum_x \pi_t(x)}^{\mbox{it sums to 1}}}{\partial\,H_t(a)} = \sum_x \frac{\partial\,1}{\partial\,H_t(a)} = 0 \\ &=& \sum_x \pi_t(x)\cdot(q_*(x)-B_t)\cdot \frac{\partial\,\pi_t(x)}{\partial\,H_t(a)}\bigg/\pi_t(x) \quad\mbox{(we multiply by }\frac{\pi_t(x)}{\pi_t(x)}=1)\\ \end{eqnarray*} \]

This expression has now the form of an expectation:

\[ \begin{eqnarray*} && \frac{\partial\,E[R_t]}{\partial\,H_t(a)} \\ &=& \sum_x \pi_t(x)\cdot(q_*(x)-B_t)\cdot \frac{\partial\,\pi_t(x)}{\partial\,H_t(a)}\bigg/\pi_t(x) \\ &=& E\left[(q_*(A_t)-B_t)\cdot \frac{\partial\,\pi_t(A_t)}{\partial\,H_t(a)}\bigg/\pi_t(A_t)\right] \end{eqnarray*} \]

By setting \(B_t=\overline{R}_t\) (by choice) and substituting \(R_t\) for \(q_*(A_t)\) because of the relation \(E[R_t\,|\,A_t]=q_*(A_t)\), we get:

\[ \begin{eqnarray*} && \frac{\partial\,E[R_t]}{\partial\,H_t(a)} \\ &=& E\left[(R_t-\overline{R}_t)\cdot \frac{\partial\,\pi_t(A_t)}{\partial\,H_t(a)}\bigg/\pi_t(A_t)\right]. \\ \end{eqnarray*} \]

Moreover we have

\[ \begin{eqnarray*} \frac{\partial\,\pi_t(A_t)}{\partial\,H_t(a)} &=& \frac{\partial}{\partial\,H_t(a)} \left[\frac{\exp(H_t(x))}{\sum_{y=1}^k\exp(H_t(y))}\right] \\ &=& \frac{\frac{\partial\,\exp(H_t(x))}{\partial\,H_t(a)} \cdot\sum_y\exp(H_t(y)) -\exp(H_t(x))\cdot \frac{\partial\,\sum_y\exp(H_t(y))}{\partial\,H_t(a)}} {[\sum_y\exp(H_t(y))]^2} \\ &=& \frac{1_{\{a=x\}}\cdot \exp(H_t(x)) \cdot\sum_y\exp(H_t(y)) -\exp(H_t(x))\cdot \exp(H_t(a))} {[\sum_y\exp(H_t(y))]^2} \\ && \mbox{because }\frac{\partial e^x}{\partial x}=e^x \\ &=& \frac{1_{\{a=x\}}\cdot \exp(H_t(x))}{\sum_y\exp(H_t(y))} -\frac{\exp(H_t(x))\cdot \exp(H_t(a))}{[\sum_y\exp(H_t(y))]^2} \\ &=& 1_{\{a=x\}}\cdot\pi_t(x)-\pi_t(x)\cdot\pi_t(a) \\ &=& \pi_t(x)\cdot\left(1_{\{a=x\}}-\pi_t(a)\right) \\ \end{eqnarray*} \] Using that piece of information, we get \[ \begin{eqnarray*} && \frac{\partial\,E[R_t]}{\partial\,H_t(a)} \\ &=& E\left[(R_t-\overline{R}_t)\cdot \frac{\partial\,\pi_t(A_t)}{\partial\,H_t(a)}\bigg/\pi_t(A_t)\right] \\ &=& E\left[(R_t-\overline{R}_t)\cdot\left(1_{\{a=A_t\}}-\pi_t(a)\right)\right]. \\ \end{eqnarray*} \]

Recall that we want to write the performance gradient as an expectation of something that we can sample on each step. Substituting a sample of the expectation yields:

\[ H_{t+1}(a)=H_t(a)+\alpha\cdot(R_t-\overline{R}_t)\cdot\left(1_{\{a=A_t\}}-\pi_t(a)\right)\quad\forall a. \] That’s why our learning algorithm is:

\[ \left\{ \begin{align} H_{t+1}(a)&:=& H_t(a)-\alpha\cdot(R_t-\overline{R}_t)\cdot\pi_t(a) \\ && \forall a\neq A_t \\ H_{t+1}(A_t)&:=& H_t(A_t)+\alpha\cdot(R_t-\overline{R}_t)\cdot\left(1-\pi_t(A_t) \right) \\ && \quad\mbox{for }A_t \mbox{ the selected action on time-step }t \\ \end{align} \right. \]

As we calculated above, this method really is a gradient ascent method. This assures us that the algorithm has robust convergence properties.

Finite Markov Decision Processes (MDP)

Markovian formalism

Discrete time steps \(t=0,1,2,\ldots\). At each step, the agent receives some representation of the environment’s state \(S_t\in\mathcal{S}\), and on that basis selects an action \(A_t\in\mathcal{A}(S_t)\). One time step later, the agent receives, in part as a consequence of its action, a numerical reward \(R_t\in\mathcal{R}\), and finds itself in a new state \(S_{t+1}\).

Note: We use \(R_{t+1}\) instead of \(R_t\) to denote the reward due to \(A_t\), because it emphasizes that the next reward and next state, \(R_{t+1}\) and \(S_{t+1}\), are jointly determined.

The interaction between the agent and its environment can be described by a MDP. They give rise to a sequence or trajectory that begins with

\[ S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,\ldots \] In a finite MDP, all three sets \(\mathcal{S},\mathcal{A}\) and \(\mathcal{R}\) have a finite number of elements. In this case, the random variables \(R_t\) and \(S_t\) have well defined discrete probability distributions dependent only on the preceding state action.

Probability of \(s'\in\mathcal{S}\) and probability of \(r\in\mathcal{R}\) occuring at time \(t\), given particular values of the preceding state and action:

\[ p(s',r \,|\, s,a):=P(S_t=s',R_t=r \,|\, s_{t-1}=s,A_{t-1}=a), \\ \forall s',s\in\mathcal{S} \mbox{ and } r\in\mathcal{R}\mbox{ and } a\in\mathcal{A}(s). \]

That function defines the dynamics of the MDP, characterizing it completely. The MDP describes the possible interactions between the agent and its environment. Of course we have that

\[ \sum_{s'}\sum_r p(s',r \,|\, s,a)=1\quad\forall s\in\mathcal{S}\mbox{ and }\forall a\in\mathcal{A}(s). \]

The state-transition probabilities:

\[ p(s' \,|\, s,a):=P(S_t=s' \,|\, s_{t-1}=s,A_{t-1}=a)=\sum_{r\in\mathcal{R}}p(s',r \,|\, s,a). \]

The expected rewards for state-action pairs:

\[ r(s,a):=E[R_t \,|\, s_{t-1}=s,A_{t-1}=a]=\sum_{r\in\mathcal{R}}r\sum_{s'\in\mathcal{S}}p(s',r \,|\, s,a). \]

The expected rewards for state-action-next state triples:

\[ r(s,a,s'):=E[R_t \,|\, s_{t-1}=s,A_{t-1}=a,S_t=s']=\sum_{r\in\mathcal{R}}r\frac{p(s',r \,|\, s,a)}{p(s' \,|\, s,a)}. \]

Have now a look at the next section where the example of a recycling robot is given in some details, then come back to pursue your reading.

Episodic and continuing tasks

The agent’s goal is to maximize the cumulative reward it receives in the long run. We seek to maximize the expected return

\[ G_t:=R_{t+1}+R_{t+2}+\ldots,R_T\,, \]

where \(T\) is the final step. This approach makes sense in applications where there is a natural notion of final time step, that is, when the agent-environment interaction breaks naturally into subsequences, which we call episodes, such as plays in a game, trips through a maze. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state. Even if you think of episodes as ending in different ways, such as winning and loosing a game, the next episode begins independently of how the previous one ended. And that’s the point. Tasks with episods are called episodic tasks. In episodic tasks we need to distinguish the set of all non terminal states, denoted \(\mathcal{S}\), from the set of all states inclusively the terminal states, denoted \(\mathcal{S}^+\). The time of termination \(T\) is a random variable that normally varies from episode to episode.

On the other hand, in many cases the agent-environment interaction does not break into identifiable episodes, but goes on continually without limit. We speak of continuing tasks. The formulation of the expected return given above is then problematic because the final step could be \(\infty\). That’s why we use discounting:

\[ G_t:=R_{t+1}+\gamma\cdot R_{t+2}++\gamma^2\cdot R_{t+3}+\ldots =\sum_{k=0}^\infty \gamma^k\cdot R_{t+k+1}\,, \]

where \(0<\gamma<1\) is the discount rate, which determines the present value of future rewards. We formulate it iteratively:

\[ \begin{eqnarray*} G_t &:=& R_{t+1}+\gamma\cdot(R_{t+2}+\gamma\cdot R_{t+3}+\ldots) \\ &=& R_{t+1}+\gamma\cdot G_{t+1} \end{eqnarray*} \]

That iterative definition works \(\forall t<T\), even if termination occurs at \(t+1\), if we define \(G_T:=0\).

A unified notation

For episodic tasks: Rather than one long sequence of time steps, we need to consider a series of episodes, each of which consists of a finite sequence of time steps. We number the steps of each episode starting anew from zero. Therefore, we have to refer not just to \(S_t\) but to \(S_{t,i}\), the state representation at time \(t\) of episode \(i\) (and similarly for \(A_{t,i},\pi_{t,i},\ldots\)). To obtain a single notation that covers both episodic and continuing tasks, we need to consider episode termination to be entering of a special absorbing state that transitions only to itself and that generates only rewards of zero.

With this convention, we can write the return as defined above, including the possibility that \(T=\infty\) or \(\gamma=1\) (no discounting), but not both.

Policy and value function for a given policy

A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy \(\pi\) at time \(t\), then \(\pi(a|s)\) is the probability that \(A_t=a\) if \(S_t=s\).

The value function of state \(s\) under a policy \(\pi\) is the expected return when starting in \(s\) and following \(\pi\) thereafter. For MDP’s, we can define:

\[ v_\pi(s):=E[G_t \,|\, S_t=s]=E_\pi\left[\sum_{k=0}^\infty \gamma^k\cdot R_{t+k+1} \,\bigg|\,S_t=s\right]\quad\forall s. \]

Here \(E_\pi[\cdot]\) denotes the expected value of a random variable given that the agent follows policy \(\pi\), and \(t\) can be any time step.

The value of taking action \(a\) in state \(s\) under a policy \(\pi\) is defined as:

\[ q_\pi(s,a):=E_\pi[G_t \,|\, S_t=s,A_t=a]=E_\pi\left[\sum_{k=0}^\infty\gamma^k\cdot R_{t+k+1} \,\bigg|\,S_t=s,A_t=a\right]. \]

It is the action value function for policy \(\pi\).

The Bellman equation

A fundamental property of value functions used throughout Reinforcement Learning is that they satisfy recursive relationships. For any policy \(\pi\) and any state \(s\), the following consistency condition holds between the value of \(s\) and the value of its possible sucessor states:

\[ \begin{eqnarray*} v_\pi(s) &:=& E_\pi[G_t \,|\, S_t=s] \\ &=& E_\pi[R_{t+1}+\gamma\cdot G_{t+1} \,|\, S_t=s] \\ &=& \sum_a \pi(a|s)\cdot\sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot E_\pi[G_{t+1} \,|\, S_{t+1}=s']] \\ &=& \sum_a \pi(a|s)\cdot\sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_\pi(s')] \quad,\forall s\in\mathcal{S} \\ \end{eqnarray*} \]

where it is implicit that the actions \(a\) are taken from the set \(\mathcal{A}(s)\), that the next states \(s'\) are taken from the set \(\mathcal{S}\) (or some \(\mathcal{S}^+\) in the case of an episodic problem), and that the rewards \(r\) are taken from the set \(\mathcal{R}\). Note how the final expression can be read easily as an expected value.

Optimal policies and optimal value functions

For finite MDP’s, we can define exactly an optimal policy in the following way.

A policy \(\pi\) can be defined to be better than or equal to a policy \(\pi'\), relationship that we write \(\pi\geq\pi'\). We define: \[ \pi\geq\pi \quad\Longleftrightarrow\quad v_{\pi}\geq v_{\pi'}(s) \quad,\forall s\in\mathcal{S} \]

It always exists at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by \(\pi_*\). They share the same value function

\[ v_*(s)_=\max_\pi v_\pi(s) \quad,\forall s\in\mathcal{S}. \] Optimal policies also share the same optimal action-value function: \[ q_*(s,a):=\max_\pi q_\pi(s,a) \quad,\forall s\in\mathcal{S} \mbox{ and }\forall s\in\mathcal{A}. \]

For the state-action pair \((s,a)\), \(q_*\) gives the expected return for taking action \(a\) in state \(s\) and thereafter following an optimal policy.

We have:

\[ q_*(s,a):=E[R_{t+1}+\gamma\cdot v_*(S_{t+1}) \,|\, S_t=s,A_t=a]. \]

Because \(v_*\) is the value function for a policy (the optimal policy), it must satisfy the self-consistency condition given by the Bellman equation for state values. Because it is the optimal value function, however, \(v_*\)’s consistency condition can be written in a special form without reference to any specific policy. That Bellman equation for \(v_*\) is called the Bellman optimality condition. Il expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state :

\[\begin{eqnarray*} v_*(s) &=& \max_{a\in\mathcal{A}(s)} q_{\pi_*}(s,a) \\ &=& \max_{a\in\mathcal{A}(s)} E_{\pi_*}[G_{t} \,|\, S_t=s,A_t=a] \\ &=& \max_{a\in\mathcal{A}(s)} E_{\pi_*}[R_{t+1}+\gamma\cdot G_{t+1} \,|\, S_t=s,A_t=a] \\ &=& \max_{a\in\mathcal{A}(s)} E_{\pi_*}[R_{t+1}+\gamma\cdot v_*(S_{t+1}) \,|\, S_t=s,A_t=a] \\ &=& \max_{a\in\mathcal{A}(s)} \sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_*(s')] \\ \end{eqnarray*}\]

This is the Bellman optimality condition for \(v_*\). For \(q_*\) it is:

\[ \begin{eqnarray*} q_*(s,a) &=& E_{\pi_*}[R_{t+1}+\gamma\cdot \max_{a'\in\mathcal{A}(s)} q_*(S_{t+1},a') \,|\, S_t=s,A_t=a] \\ &=& \sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot \max_{a\in\mathcal{A}(s)}q_*(s',a')] \\ \end{eqnarray*} \]

The recycling robot:
The set up

The robot has two battery charge levels: \(\mathcal{S}=\{high,low\}\).
The robot can keep on searching more garbage, or it can wait as a static garbage collector, or it can go back to its dock to recharge battery: \(\mathcal{A}=\{search, wait, load\}\).

The reward is zero most of the time, but becomes positive when the robot secures an empty can, or large and negative if the battery runs all the way down. The best way to find cans is to actively search for them, but this runs down the battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued.

The corresponding Markov chain can be given by a graph and thus by a data set specifying each edge of that graph. The first column of the data set below corresponds to the “from node”, the second column to the “to node”, the third column corresponds to the transition probability. Each row stands for an edge of the graph.

MDP’s are special Markov chains in that sense that one need to distinguish between state nodes (here the two states \(h\) for “charge level of the battery is high” and \(l\) for “charge level of the battery is low”) and state-action nodes (for example the pair \((search, high)\) meaning “given that the charge level of the battery is high, the agent decides to search for new cans”. Other pairs are for example \((search, low)\), or \((wait,low)\)).

Any edge is given together with a transition probability. In case of an edge between two state nodes \(s\) and \(s'\), the former being the “from node”, the probability can be factorized as \(p(s'|s)=p(s'|a,s)\cdot p(a|s)\). The probability \(p(a|s)\) describes the agent’s decision and doesn’t belong to the description of the environment itself. In order to use the great markovchain package, we choose values for the \(p(a|s)\), paying attention to the sum-to-one constraint:

library("markovchain")
library("expm")

alpha <- 0.3
beta <- 0.6
df <- vector("list",length=3)
df <- as.data.frame(df)
df <- rbind(list("sear_h","h",alpha),df)
df <- rbind(list("sear_h","l",1-alpha),df)
df <- rbind(list("wait_h","h",1),df)
df <- rbind(list("sear_l","h",1-beta),df)
df <- rbind(list("sear_l","l",beta),df)
df <- rbind(list("wait_l","l",1),df)
df <- rbind(list("load_l","h",1),df)
df <- rbind(list("l","load_l",1/3),df)
df <- rbind(list("l","wait_l",1/3),df)
df <- rbind(list("l","sear_l",1/3),df)
df <- rbind(list("h","wait_h",1/2),df)
df <- rbind(list("h","sear_h",1/2),df)


colnames(df) <- c("t0","t1","prob")
classes <- sapply(df,class)
classes
##          t0          t1        prob 
## "character" "character"   "numeric"
colnames <- c(1:2)
# Below: in order to have only one set of levels for all columns
# instead of one different set of levels for each column, we use levels=unique(unlist(...))
df[,colnames] <- lapply(df[,colnames],factor, levels = unique(unlist(df[,colnames])))
classes <- sapply(df,class)
classes
##        t0        t1      prob 
##  "factor"  "factor" "numeric"
df
##        t0     t1      prob
## 1       h sear_h 0.5000000
## 2       h wait_h 0.5000000
## 3       l sear_l 0.3333333
## 4       l wait_l 0.3333333
## 5       l load_l 0.3333333
## 6  load_l      h 1.0000000
## 7  wait_l      l 1.0000000
## 8  sear_l      l 0.6000000
## 9  sear_l      h 0.4000000
## 10 wait_h      h 1.0000000
## 11 sear_h      l 0.7000000
## 12 sear_h      h 0.3000000

We next define a R object that represents the Markov chain.

mcf <- as(df,"markovchain")
mcf
## Unnamed Markov chain 
##  A  7 - dimensional discrete Markov Chain defined by the following states: 
##  h, l, load_l, wait_l, sear_l, wait_h, sear_h 
##  The transition matrix  (by rows)  is defined as follows: 
##          h   l    load_l    wait_l    sear_l wait_h sear_h
## h      0.0 0.0 0.0000000 0.0000000 0.0000000    0.5    0.5
## l      0.0 0.0 0.3333333 0.3333333 0.3333333    0.0    0.0
## load_l 1.0 0.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## wait_l 0.0 1.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## sear_l 0.4 0.6 0.0000000 0.0000000 0.0000000    0.0    0.0
## wait_h 1.0 0.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## sear_h 0.3 0.7 0.0000000 0.0000000 0.0000000    0.0    0.0

The abstract Markov chain is illustrated by the following table. You notice that alpha and beta are not valued. You also notice that the variables \(aa\), \(bb\) and \(cc\) which figure the decision the agent has to make, are not valued. alpha and beta are parameters of the environment itself whereas \(aa\), \(bb\) and \(cc\) define what we call the policy.

The corresponding graph:

The transition graph shows two kinds of nodes: state nodes (h,l) and action nodes (sear_h, sear_h, wait_h, wait_l, load_l). A state node for each state, an action node for each pair (state, action). Starting in state \(s\) and taking action \(a\) moves you along the line from state node \(s\) to action node \((s,a)\). Then the environment responds with a transition to the next state node via one of the arrows leaving action node \((s,a)\) where \(s'\) is the next state.

The recycling robot:
Evaluation of a given policy

Dynamic Programming

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Moreover we assume that the state, action, and reward sets \(\mathcal{S}\), \(\mathcal{A}\), and \(\mathcal{R}\), are finite, and that the environment’s dynamics are given by a set of probabilities \(p(s',r\,|\,s,a)\), \(\forall s\in\mathcal{S}, a\in\mathcal{A}(s)\), \(r\in\mathcal{R}\), and \(s'\in\mathcal{S}^+\) (\(\mathcal{S}^+\) is \(\mathcal{S}\) plus a terminal state if the problem is episodic).

Policy evaluation

How to compute the state-value function \(v_\pi\) for an arbitrary policy \(\pi\)?

Recall that

\[\begin{eqnarray*} v_\pi(s) &:=& E_\pi[G_t \,|\, S_t=s] \\ &=& \sum_a \pi(a|s)\cdot\sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_\pi(s')] \quad,\forall s\in\mathcal{S} \\ \end{eqnarray*}\]

The state-value function \(v_\pi\) for the given policy \(\pi\) can thus be obtained following the fix point method:

\[\begin{eqnarray*} v_{k+1}(s) &:=& \sum_a \pi(a|s)\cdot\sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_k(s')] \quad,\forall s\in\mathcal{S} \\ \end{eqnarray*}\]

Each iteration of the algorithm updates the value of every state in order to produce the new approximate value function \(v_{k+1}\). To write a sequential program to implement iterative policy evaluation, one would have to use two arrays, one for the old values, \(v_k(s)\), and one for the new values, \(v_{k+1}(s)\). With two arrays, the new values can be computed one by one from the old values without the old values being changed. Alternatively, one could use only one array and update the values “in place”, that is, with each new value immediately overwriting the old one. It could be shown that the method with only one array converges faster than the two-array version.

Recall the MDP that describes the environment of the recycling robot:

You notice that alpha and beta are not valued. You also notice that the variables \(aa\), \(bb\) and \(cc\) which figure the policy, i.e. the decision the agent has to make, are neither valued.

To calculate, one have to give values for \(aa\), \(bb\), \(cc\), i.e. to fix a policy, as well as to give values for alpha and beta, which are the parameters of our environment model (MDP). We obtain a R-object, which is a Markov chain:

show(mcf)
## Unnamed Markov chain 
##  A  7 - dimensional discrete Markov Chain defined by the following states: 
##  h, l, load_l, wait_l, sear_l, wait_h, sear_h 
##  The transition matrix  (by rows)  is defined as follows: 
##          h   l    load_l    wait_l    sear_l wait_h sear_h
## h      0.0 0.0 0.0000000 0.0000000 0.0000000    0.5    0.5
## l      0.0 0.0 0.3333333 0.3333333 0.3333333    0.0    0.0
## load_l 1.0 0.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## wait_l 0.0 1.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## sear_l 0.4 0.6 0.0000000 0.0000000 0.0000000    0.0    0.0
## wait_h 1.0 0.0 0.0000000 0.0000000 0.0000000    0.0    0.0
## sear_h 0.3 0.7 0.0000000 0.0000000 0.0000000    0.0    0.0

In this example, the probabilities for the decisions of the agent \(aa\), \(bb\) and \(cc\) were set equal to 0.5, 0.333 and 0.333 respectively, which corresponds to a uniform policy (each choice made with equal probability), and the probabilities alpha and beta were set equal to 0.3 and 0.4 respectively.

The equation

\[\begin{eqnarray*} v_{k+1}(s) &:=& \sum_a \pi(a|s)\cdot\sum_{s'}\sum_r p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_k(s')] \quad,\forall s\in\mathcal{S}, \\ \end{eqnarray*}\]

when applied to the MDP of the recycling robot, becomes

\[ \begin{array}{lllll} v_{k+1}(s) &:=& \pi(load,s)& \cdot\bigg( \\ &&& p(h,r|load,s)\cdot[r+\gamma\cdot v_k(h)] \\ &&+& p(h,0|load,s)\cdot[0+\gamma\cdot v_k(h)] \\ &&+& p(l,r|load,s)\cdot[r+\gamma\cdot v_k(l)] \\ &&+& p(l,0|load,s)\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(wait,s)& \cdot\bigg( \\ &&& p(h,r|wait,s)\cdot[r+\gamma\cdot v_k(h)] \\ &&+& p(h,0|wait,s)\cdot[0+\gamma\cdot v_k(h)] \\ &&+& p(l,r|wait,s)\cdot[r+\gamma\cdot v_k(l)] \\ &&+& p(l,0|wait,s)\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(search,s)& \cdot\bigg( \\ &&& p(h,r|search,s)\cdot[r+\gamma\cdot v_k(h)] \\ &&+& p(h,0|search,s)\cdot[0+\gamma\cdot v_k(h)] \\ &&+& p(l,r|search,s)\cdot[r+\gamma\cdot v_k(l)] \\ &&+& p(l,0|search,s)\cdot[0+\gamma\cdot v_k(l)] \bigg)\,, \\ \forall s\in\{h,l\}. \\ \end{array} \]

We set a value for the reward \(r\), say \(r=1\). We decompose each factor \(p(h,r|a,s)\):

\[ \begin{eqnarray*} p(h,r|a,s) &=& p(h|a,s,r)\cdot p(r|a,s) \\ &=& p(h|a,s)\cdot p(r|a,s) \,, \end{eqnarray*} \] Specifically for action \(a=load\), we set (notice that the specific state-action node \((load,h)\) doesn’t exist):

\[\begin{array}{lll ll} p(h,r|load,h) &=& p(h|load,h)\cdot p(r|load,h) &=& 0 \\ p(h,r|load,l) &=& p(h|load,l)\cdot p(r|load,l) &=& p(h|load,l)\cdot 0 \\ p(h,0|load,h) &=& p(h|load,h)\cdot p(0|load,h) &=& 0 \\ p(h,0|load,l) &=& p(h|load,l)\cdot p(0|load,l) &=& p(h|load,l)\cdot 1 \\ \\ p(l,r|load,h) &=& p(l|load,h)\cdot p(r|load,h) &=& 0 \\ p(l,r|load,l) &=& p(l|load,l)\cdot p(r|load,l) &=& 0 \\ p(l,0|load,h) &=& p(l|load,h)\cdot p(0|load,h) &=& 0 \\ p(l,0|load,l) &=& p(l|load,l)\cdot p(0|load,l) &=& 0 \\ \\ \end{array}\]

Analogously (the \(\rho\) are probabilities of getting reward \(r\)), we set, for action \(a=wait\):

\[\begin{array}{lll ll} p(h,r|wait,h) &=& p(h|wait,h)\cdot p(r|wait,h) &=& 1 \cdot \rho_{h|wait,h} \\ p(h,r|wait,l) &=& p(h|wait,l)\cdot p(r|wait,l) &=& 0 \\ p(h,0|wait,h) &=& p(h|wait,h)\cdot p(0|wait,h) &=& 1 \cdot (1-\rho_{h|wait,h}) \\ p(h,0|wait,l) &=& p(h|wait,l)\cdot p(0|wait,l) &=& 0 \\ \\ p(l,r|wait,h) &=& p(l|wait,h)\cdot p(r|wait,h) &=& 0 \\ p(l,r|wait,l) &=& p(l|wait,l)\cdot p(r|wait,l) &=& 1 \cdot \rho_{l|wait,l} \\ p(l,0|wait,h) &=& p(l|wait,h)\cdot p(0|wait,h) &=& 0 \\ p(l,0|wait,l) &=& p(l|wait,l)\cdot p(0|wait,l) &=& 1 \cdot (1-\rho_{l|wait,l}) \\ \\ \end{array}\]

Finally for action \(a=search\), we set:

\[\begin{array}{lll ll} p(h,r|search,h) &=& p(h|search,h)\cdot p(r|search,h) &=& alpha \cdot \rho_{h|search,h} \\ p(h,r|search,l) &=& p(h|search,l)\cdot p(r|search,l) &=& (1-beta) \cdot \rho_{h|search,l} \\ p(h,0|search,h) &=& p(h|search,h)\cdot p(0|search,h) &=& alpha \cdot (1-\rho_{h|search,h}) \\ p(h,0|search,l) &=& p(h|search,l)\cdot p(0|search,l) &=& (1-beta) \cdot (1-\rho_{h|search,l}) \\ \\ p(l,r|search,h) &=& p(l|search,h)\cdot p(r|search,h) &=& (1-alpha) \cdot \rho_{l|search,h} \\ p(l,r|search,l) &=& p(l|search,l)\cdot p(r|search,l) &=& beta \cdot \rho_{l|search,l} \\ p(l,0|search,h) &=& p(l|search,h)\cdot p(0|search,h) &=& (1-alpha) \cdot (1-\rho_{l|search,h}) \\ p(l,0|search,l) &=& p(l|search,l)\cdot p(0|search,l) &=& beta \cdot (1-\rho_{l|search,l}) \\ \\ \end{array}\]

Putting all this together we get the following two equations, one for each state \(h\) and \(l\).
The equation for \(s=h\):

\[ \begin{array}{lllll} v_{k+1}(h) &:=& \pi(load,h)& \cdot\bigg( \\ &&& 0\cdot[r+\gamma\cdot v_k(h)] \\ &&+& 0\cdot[0+\gamma\cdot v_k(h)] \\ &&+& 0\cdot[r+\gamma\cdot v_k(l)] \\ &&+& 0\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(wait,h)& \cdot\bigg( \\ &&& 1 \cdot \rho_{h|wait,h}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& 1 \cdot (1-\rho_{h|wait,h})\cdot[0+\gamma\cdot v_k(h)] \\ &&+& 0\cdot[r+\gamma\cdot v_k(l)] \\ &&+& 0\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(search,h)& \cdot\bigg( \\ &&& alpha \cdot \rho_{h|search,h}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& alpha \cdot (1-\rho_{h|search,h})\cdot[0+\gamma\cdot v_k(h)] \\ &&+& (1-alpha) \cdot \rho_{l|search,h}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& (1-alpha) \cdot (1-\rho_{l|search,h})\cdot[0+\gamma\cdot v_k(l)] \bigg)\,, \\ \forall s\in\{h,l\}. \\ \end{array} \]

…and the equation for \(s=l\): \[ \begin{array}{lllll} v_{k+1}(l) &:=& \pi(load,l)& \cdot\bigg( \\ &&& p(h|load,l)\cdot 0\cdot[r+\gamma\cdot v_k(h)] \\ &&+& p(h|load,l)\cdot 1\cdot[0+\gamma\cdot v_k(h)] \\ &&+& 0\cdot[r+\gamma\cdot v_k(l)] \\ &&+& 0\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(wait,l)& \cdot\bigg( \\ &&& 0\cdot[r+\gamma\cdot v_k(h)] \\ &&+& 0\cdot[0+\gamma\cdot v_k(h)] \\ &&+& 1 \cdot \rho_{l|wait,l}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& 1 \cdot (1-\rho_{l|wait,l})\cdot[0+\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(search,l)& \cdot\bigg( \\ &&& (1-beta) \cdot \rho_{h|search,l}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& (1-beta) \cdot (1-\rho_{h|search,l})\cdot[0+\gamma\cdot v_k(h)] \\ &&+& beta \cdot \rho_{l|search,l}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& beta \cdot (1-\rho_{l|search,l})\cdot[0+\gamma\cdot v_k(l)] \bigg)\,, \\ \forall s\in\{h,l\}. \\ \end{array} \] We then simplify these equations.
The equation for \(s=h\):

\[ \begin{array}{lllll} v_{k+1}(h) &:=& \pi(wait,h)& \cdot\bigg( \\ &&& \rho_{h|wait,h}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& (1-\rho_{h|wait,h})\cdot[\gamma\cdot v_k(h)] \bigg) \\ &+ \\ &&\pi(search,h)& \cdot\bigg( \\ &&& alpha \cdot \rho_{h|search,h}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& alpha \cdot (1-\rho_{h|search,h})\cdot[\gamma\cdot v_k(h)] \\ &&+& (1-alpha) \cdot \rho_{l|search,h}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& (1-alpha) \cdot (1-\rho_{l|search,h})\cdot[\gamma\cdot v_k(l)] \bigg)\,, \\ \forall s\in\{h,l\}. \\ \end{array} \] The equation for \(s=l\): \[ \begin{array}{lllll} v_{k+1}(l) &:=& \pi(load,l)& \cdot\bigg( \\ &&& [\gamma\cdot v_k(h)] \bigg) \\ &+ \\ &&\pi(wait,l)& \cdot\bigg( \\ &&& \rho_{l|wait,l}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& (1-\rho_{l|wait,l})\cdot[\gamma\cdot v_k(l)] \bigg) \\ &+ \\ &&\pi(search,l)& \cdot\bigg( \\ &&& (1-beta) \cdot \rho_{h|search,l}\cdot[r+\gamma\cdot v_k(h)] \\ &&+& (1-beta) \cdot (1-\rho_{h|search,l})\cdot[\gamma\cdot v_k(h)] \\ &&+& beta \cdot \rho_{l|search,l}\cdot[r+\gamma\cdot v_k(l)] \\ &&+& beta \cdot (1-\rho_{l|search,l})\cdot[\gamma\cdot v_k(l)] \bigg)\,, \\ \forall s\in\{h,l\}. \\ \end{array} \]

Implementation

We implement the algorithm for the uniform policy (recall that the policy is determined by the variables \(aa\), \(bb\) and \(cc\)):

# parameters related to the policy
aa <- 1/2
bb <- 1/3
cc <- 1/3
pi_wait_h <- aa
pi_sear_h <- 1-aa
pi_load_l <- bb
pi_wait_l <- cc
pi_sear_l <- 1-bb-cc

# parameters related to the environment itself
alpha <- 0.3
beta <- 0.4
gamma <- 0.9  # the discount factor of the expected return
## probabilities of getting a reward.
## There is here no sum-to-one constraint at all!
rho_h.wait_h <- 0.1 #little probability for getting the reward
rho_h.sear_h <- 0.1 #little probability for getting the reward
rho_l.sear_h <- 0.9 #large probability for getting the reward
rho_l.wait_l <- 0   #zero probability for getting the reward
rho_h.sear_l <- 1   #certitude of getting A LARGE NEGATIVE reward
rho_l.sear_l <- 0.2
r <- 1              # normal reward
r_stuck <- -3       # large negative reward
# initialization step of the recursive algorithm
v <- vector("numeric",2)  # vector v=:(v[h],v[l])
v[1] <- 0
v[2] <- 0

w <- function(v,s,policy_vector){
# recursive step of the algorithm
#
# input:  > s, one of the two possible state h and l
#         > v, a 2-dimensional vector v=:(v[h],v[l]), meant to be an approximation
#            for the true state-value vector (v_stern[h],v_stern[l])
#         > policy_vector, a vector of three probabilities defining a policy.
#            Let's write policy_vector as (aa,bb,cc). Then:
#                     aa stands for p(wait | h), where wait is the state-action
#                     node (wait,h),
#                     bb stands for p(load | l),
#                     cc stands for p(wait | l).
# output: an improved approximation vector v after a next iteration step
# --------------------------------------

  # parameters related to the policy
  aa <- policy_vector[1]
  bb <- policy_vector[2]
  cc <- policy_vector[3]
  pi_wait_h <- aa
  pi_sear_h <- 1-aa
  pi_load_l <- bb
  pi_wait_l <- cc
  pi_sear_l <- 1-bb-cc

  w_h <- pi_wait_h * (rho_h.wait_h * (r + gamma * v[1])
                + (1-rho_h.wait_h) * (gamma * v[1]))
          + pi_sear_h * (alpha * rho_h.sear_h * (r + gamma * v[1])
                + alpha * (1-rho_h.sear_h) * (gamma * v[1])
                + (1-alpha) * rho_l.sear_h * (r + gamma * v[2])
                + (1-alpha) * (1-rho_l.sear_h) * (gamma * v[2]))
  
  w_l <- pi_load_l*(gamma*v[1])
        + pi_wait_l*(rho_l.wait_l*(r+gamma*v[2])
            + (1-rho_l.wait_l)*(gamma*v[2]))
        + pi_sear_l*((1-beta)*rho_h.sear_l*(r+gamma*v[1])
            + (1-beta)*(1-rho_h.sear_l)*(gamma*v[1])
            + beta*rho_l.sear_l*(r+gamma*v[2])
            + beta*(1-rho_l.sear_l)*(gamma*v[2]))

  if(s == "h"){
    v[1] <- w_h
  }
  else if(s == "l"){
    v[2] <- w_l
  }
  else{
    cat("ERROR")
  }
  return(v)
}
w(v,"h",c(1/2,1/3,1/3))
## [1] 0.05 0.00
norm_vec <- function(x) sqrt(sum(x^2))

# initialization step of the recursive algorithm
v <- vector("numeric",2)  # vector v=:(v[h],v[l])
v[1] <- 0
v[2] <- 0

# policy evaluation
v_old <- v+1
threshold <- 10^(-8)  # calculation precision
while(norm_vec(v-v_old)>threshold){
  v_old <- v
  for(s in c("h","l")){
    v <- w(v,s,c(1/2,1/3,1/3))
  }
  print(v)
}
## [1] 0.050 0.015
## [1] 0.07250 0.02175
## [1] 0.0826250 0.0247875
## [1] 0.08718125 0.02615438
## [1] 0.08923156 0.02676947
## [1] 0.09015420 0.02704626
## [1] 0.09056939 0.02717082
## [1] 0.09075623 0.02722687
## [1] 0.09084030 0.02725209
## [1] 0.09087814 0.02726344
## [1] 0.09089516 0.02726855
## [1] 0.09090282 0.02727085
## [1] 0.09090627 0.02727188
## [1] 0.09090782 0.02727235
## [1] 0.09090852 0.02727256
## [1] 0.09090883 0.02727265
## [1] 0.09090898 0.02727269
## [1] 0.09090904 0.02727271
## [1] 0.09090907 0.02727272
## [1] 0.09090908 0.02727272
## [1] 0.09090909 0.02727273

We have obtained the state values \(v_\pi(h)=0.0909\) and \(v_\pi(l)=0.02727\) for the given policy, in the example the uniform policy, consisting in the agent taking each possible choice with equal probability.

The recycling robot:
Policy improvement

One reason for computing the value function for a given policy is to help find better policies. Suppose we have determined the value function \(v_\pi\) for an arbitrary deterministic policy \(\pi\). For some state \(s\) we would like to know whether or not we should change the policy to deterministically choose an action \(a\notin\pi(s)\). We know how good it is to follow the current policy from \(s\) - that is \(v_\pi(s)\) - but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting action \(a\) when in state \(s\) and thereafter following the existing policy \(\pi\). The value of this new policy is:

\[ \begin{eqnarray*} q_\pi(s,a) &:=& E_\pi[R_{t+1}+\gamma\cdot v_\pi(S_{t+1}) \,|\, S_t=s,A_t=a] \\ &=& \sum_{s',r}p(s',r \,|\, s,a)\cdot[r+\gamma\cdot v_\pi(s')]. \end{eqnarray*} \]

One has to verify whether this is greater than or less than \(v_\pi(s)\). If this is greater, it means it is better to select action \(a\) once in state \(s\) and thereafter follow policy \(\pi\) than it would be to follow \(\pi\) all the time.

Policy improvement theorem
For any pair \((\pi,\pi')\) of deterministic policies such that, \[ q_\pi(s,\pi'(s))\geq v_\pi(s), \quad\forall s\in\mathcal{S}, \] the policy \(\pi'\) is as good as, or better than, \(\pi\). That is \[ v_{\pi'}(s)\geq v_\pi(s), \quad\forall s\in\mathcal{S}. \]

Note: In the general, stochastic, case, a stochastic policy \(\pi\) specifies probabilities \(p(a|s)\) for taking each action \(a\) when in state \(s\). One has to know that the policy improvement theorem holds not only for deterministic but also for stochastic policies (without proof).

Policy iteration

Once a policy \(\pi\) has been improved using \(v_\pi\) to yield a better policy \(\pi'\), we can then compute \(v_{\pi'}\) and improve it again to yield an even better \(\pi''\). We can thus obtain a sequence of improving policies and value functions:

\[ \pi_0 \longrightarrow v_{\pi_0} \longrightarrow \pi_1 \longrightarrow v_{\pi_1} \cdots \longrightarrow \pi_* \longrightarrow v_{\pi_*} \] This way of obtaining an optimal policy is called policy iteration.

We first recall how to evaluate a given policy, as already developped before:

#  Policy evaluation
policy_evaluation <- function(policy_vector){
  # Initialization
  v <- vector("numeric",2)  # vector v=:(v[h],v[l])
  v[1] <- 0
  v[2] <- 0
  
  v_old <- v+1
  threshold <- 10^(-8)  # calculation precision
  while(norm_vec(v-v_old)>threshold){
    v_old <- v
    for(s in c("h","l")){
      v <- w(v,s,policy_vector)
    }
  }
  return(v)
}

Notice that the function policy_evaluation() evaluates policies which are stochastic, as for example the uniform policy consisting in the agent always choosing an action with equal probability among all the actions that are available at the actual state, according to the MDP:

policy_evaluation(c(1/2,1/3,1/3))
## [1] 0.09090909 0.02727273

which means that the state value of state “h” is 0.090909 and that the state value of state “l” is 0.0272727.

The algorithm for policy improvement we want to implement works with deterministic policies. That means that only 0/1 values are possible for the elements of the vector policy_vector. In fact, the only possible vector values for policy_vector are the following: (1,1,0), (1,0,1), (1,0,0), (0,1,0), (0,0,1), (0,0,0).

We next define an important function, one which determines the arg max over all actions for the quantity sum_to_maximize(action,s,policy). It is the core of the algorithm for policy improvement:

arg_max <- function(s,policy_vector){
  # calculation of the improved policy for state s
  #
  # input:  > s, one of the two possible state h and l
  #         > policy_vector: one among 6 combinations (1,0,0), (1,1,0),...
  #              defining deterministic instead of stochastic policies
  # output: > a new policy_vector among these 6 possible policies
  # --------------------------------------
  
  # parameters related to the policy
  aa <- policy_vector[1]
  bb <- policy_vector[2]
  cc <- policy_vector[3]
  pi_wait_h <- aa
  pi_sear_h <- 1-aa
  pi_load_l <- bb
  pi_wait_l <- cc
  pi_sear_l <- 1-bb-cc  
  
  v <- vector("numeric",2)
  v[1] <- policy_evaluation(policy_vector)[1]
  v[2] <- policy_evaluation(policy_vector)[2]
    
  if(s == "h"){
    linie1 <- list("load",0)
    linie2 <- list("wait",(rho_h.wait_h * (r + gamma * v[1])
                + (1-rho_h.wait_h) * (gamma * v[1])))
    linie3 <- list("search",(alpha * rho_h.sear_h * (r + gamma * v[1])
                + alpha * (1-rho_h.sear_h) * (gamma * v[1])
                + (1-alpha) * rho_l.sear_h * (r + gamma * v[2])
                + (1-alpha) * (1-rho_l.sear_h) * (gamma * v[2])))
    dd <- as.data.frame(linie1)
    dd <- rbind(dd,linie2,linie3)
    out <- dd[dd[2]==max(dd[2])]
    return(out[1])
  
  } else if(s == "l"){
    linie1 <- list("load",(gamma*v[1]))
    linie2 <- list("wait",(rho_l.wait_l*(r+gamma*v[2])
            + (1-rho_l.wait_l)*(gamma*v[2])))
    linie3 <- list("search",((1-beta)*rho_h.sear_l*(r+gamma*v[1])
            + (1-beta)*(1-rho_h.sear_l)*(gamma*v[1])
            + beta*rho_l.sear_l*(r+gamma*v[2])
            + beta*(1-rho_l.sear_l)*(gamma*v[2])))
    dd <- as.data.frame(linie1)
    dd <- rbind(dd,linie2,linie3)
    out <- dd[dd[2]==max(dd[2])]
    return(out[1])
  
  } else{
    cat("ERROR")
  }
}

arg_max("l",c(0,0,0))
## [1] "search"

We then define the recursive step of the algorithm:

new_policy_step <- function(s,policy_vector){
# recursive step of the algorithm
#
# input:  > s, one of the two possible state h and l
#         > policy_vector, a vector of three probabilities defining a policy.
#            Let's write policy_vector as (aa,bb,cc). Then:
#                     aa stands for p(wait | h), where wait is the state-action
#                     node (wait,h),
#                     bb stands for p(load | l),
#                     cc stands for p(wait | l).
# output: an improved policy_vector after a further iteration step
# --------------------------------------

  if(s == "h" & arg_max(s,policy_vector) == "load"){
      cat("ERROR: state-action node (load|h) doesn't exist")
      
  } else if(s == "h" & arg_max(s,policy_vector) == "wait"){
    new_policy <- policy_vector
    new_policy[1] <- 1
      
  } else if(s == "h" & arg_max(s,policy_vector) == "search"){
    new_policy <- policy_vector
    new_policy[1] <- 0
      
  } else if(s == "l" & arg_max(s,policy_vector) == "load"){
    new_policy <- policy_vector
    new_policy[2] <- 1
    new_policy[3] <- 0
      
  } else if(s == "l" & arg_max(s,policy_vector) == "wait"){
    new_policy <- policy_vector
    new_policy[2] <- 0
    new_policy[3] <- 1
      
  } else if(s == "l" & arg_max(s,policy_vector) == "search"){
    new_policy <- policy_vector
    new_policy[2] <- 0
    new_policy[3] <- 0
      
  } else{cat("ERROR2")}
  return(new_policy)
}

policy <- c(0,0,0)  # This is the policy defined by p(search|h)=1 and p(search|l)=1,
                    # that means : the agent chooses anytime the action "search".
new_policy_step("h",policy)
## [1] 0 0 0

Finally we write the recursive algorithm for policy improvement:

# Let's consider a policy that we want to be improved.
policy <- c(0,0,0)  # This is the policy defined by p(search|h)=1 and p(search|l)=1,
                    # that means : the agent chooses anytime the action "search".

policy_improvement <- function(policy_vector){
  policy_stable <- TRUE
  old_policy <- vector("numeric",3)

  for(s in c("h","l")){
    old_policy <- policy
    policy <- new_policy_step(s,policy)
    if(identical(old_policy,policy)){
      policy_stable <- FALSE
    }
  }
  if(policy_stable == TRUE){
  return(policy)
  }else{
    cat("ERROR3")}
}

For the purpose of illustrating the algorithm, we now improve a given deterministic policy, say the policy that makes the agent choose the action “search” no matter what is the actual state, i.e. the policy described by the vector \((0,0,0)\):

policy <- c(0,0,0)
policy_improvement(policy)
## ERROR3